{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "# Timeseries anomaly detection using an Autoencoder\n",
    "\n",
    "**Author:** [pavithrasv](https://github.com/pavithrasv)<br>\n",
    "**Date created:** 2020/05/31<br>\n",
    "**Last modified:** 2020/05/31<br>\n",
    "**Description:** Detect anomalies in a timeseries using an Autoencoder."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Introduction\n",
    "\n",
    "This script demonstrates how you can use a reconstruction convolutional\n",
    "autoencoder model to detect anomalies in timeseries data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Setup\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from tensorflow import keras\n",
    "from tensorflow.keras import layers\n",
    "from datetime import datetime\n",
    "from matplotlib import pyplot as plt\n",
    "from matplotlib import dates as md\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Load the data\n",
    "\n",
    "We will use the [Numenta Anomaly Benchmark(NAB)](\n",
    "https://www.kaggle.com/boltzmannbrain/nab) dataset. It provides artifical\n",
    "timeseries data containing labeled anomalous periods of behavior. Data are\n",
    "ordered, timestamped, single-valued metrics.\n",
    "\n",
    "We will use the `art_daily_small_noise.csv` file for training and the\n",
    "`art_daily_jumpsup.csv` file for testing. The simplicity of this dataset\n",
    "allows us to demonstrate anomaly detection effectively.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "master_url_root = \"https://raw.githubusercontent.com/numenta/NAB/master/data/\"\n",
    "\n",
    "df_small_noise_url_suffix = \"artificialNoAnomaly/art_daily_small_noise.csv\"\n",
    "df_small_noise_url = master_url_root + df_small_noise_url_suffix\n",
    "df_small_noise = pd.read_csv(df_small_noise_url)\n",
    "\n",
    "df_daily_jumpsup_url_suffix = \"artificialWithAnomaly/art_daily_jumpsup.csv\"\n",
    "df_daily_jumpsup_url = master_url_root + df_daily_jumpsup_url_suffix\n",
    "df_daily_jumpsup = pd.read_csv(df_daily_jumpsup_url)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Quick look at the data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "print(df_small_noise.head())\n",
    "\n",
    "print(df_daily_jumpsup.head())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Visualize the data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "\n",
    "def plot_dates_values(data):\n",
    "    dates = data[\"timestamp\"].to_list()\n",
    "    values = data[\"value\"].to_list()\n",
    "    dates = [datetime.strptime(x, \"%Y-%m-%d %H:%M:%S\") for x in dates]\n",
    "    plt.subplots_adjust(bottom=0.2)\n",
    "    plt.xticks(rotation=25)\n",
    "    ax = plt.gca()\n",
    "    xfmt = md.DateFormatter(\"%Y-%m-%d %H:%M:%S\")\n",
    "    ax.xaxis.set_major_formatter(xfmt)\n",
    "    plt.plot(dates, values)\n",
    "    plt.show()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "### Timeseries data without anomalies\n",
    "\n",
    "We will use the following data for training.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "plot_dates_values(df_small_noise)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "### Timeseries data with anomalies\n",
    "\n",
    "We will use the following data for testing and see if the sudden jump up in the\n",
    "data is detected as an anomaly.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "plot_dates_values(df_daily_jumpsup)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Prepare training data\n",
    "\n",
    "Get data values from the training timeseries data file and normalize the\n",
    "`value` data. We have a `value` for every 5 mins for 14 days.\n",
    "\n",
    "-   24 * 60 / 5 = **288 timesteps per day**\n",
    "-   288 * 14 = **4032 data points** in total\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "\n",
    "def get_value_from_df(df):\n",
    "    return df.value.to_list()\n",
    "\n",
    "\n",
    "def normalize(values):\n",
    "    mean = np.mean(values)\n",
    "    values -= mean\n",
    "    std = np.std(values)\n",
    "    values /= std\n",
    "    return values, mean, std\n",
    "\n",
    "\n",
    "# Get the `value` column from the training dataframe.\n",
    "training_value = get_value_from_df(df_small_noise)\n",
    "\n",
    "# Normalize `value` and save the mean and std we get,\n",
    "# for normalizing test data.\n",
    "training_value, training_mean, training_std = normalize(training_value)\n",
    "len(training_value)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "### Create sequences\n",
    "Create sequences combining `TIME_STEPS` contiguous data values from the\n",
    "training data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "TIME_STEPS = 288\n",
    "\n",
    "\n",
    "def create_sequences(values, time_steps=TIME_STEPS):\n",
    "    output = []\n",
    "    for i in range(len(values) - time_steps):\n",
    "        output.append(values[i : (i + time_steps)])\n",
    "    # Convert 2D sequences into 3D as we will be feeding this into\n",
    "    # a convolutional layer.\n",
    "    return np.expand_dims(output, axis=2)\n",
    "\n",
    "\n",
    "x_train = create_sequences(training_value)\n",
    "print(\"Training input shape: \", x_train.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Build a model\n",
    "\n",
    "We will build a convolutional reconstruction autoencoder model. The model will\n",
    "take input of shape `(batch_size, sequence_length, num_features)` and return\n",
    "output of the same shape. In this case, `sequence_length` is 288 and\n",
    "`num_features` is 1.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "model = keras.Sequential(\n",
    "    [\n",
    "        layers.Input(shape=(x_train.shape[1], x_train.shape[2])),\n",
    "        layers.Conv1D(\n",
    "            filters=32, kernel_size=7, padding=\"same\", strides=2, activation=\"relu\"\n",
    "        ),\n",
    "        layers.Dropout(rate=0.2),\n",
    "        layers.Conv1D(\n",
    "            filters=16, kernel_size=7, padding=\"same\", strides=2, activation=\"relu\"\n",
    "        ),\n",
    "        layers.Conv1DTranspose(\n",
    "            filters=16, kernel_size=7, padding=\"same\", strides=2, activation=\"relu\"\n",
    "        ),\n",
    "        layers.Dropout(rate=0.2),\n",
    "        layers.Conv1DTranspose(\n",
    "            filters=32, kernel_size=7, padding=\"same\", strides=2, activation=\"relu\"\n",
    "        ),\n",
    "        layers.Conv1DTranspose(filters=1, kernel_size=7, padding=\"same\"),\n",
    "    ]\n",
    ")\n",
    "model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss=\"mse\")\n",
    "model.summary()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Train the model\n",
    "\n",
    "Please note that we are using `x_train` as both the input and the target\n",
    "since this is a reconstruction model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "history = model.fit(\n",
    "    x_train,\n",
    "    x_train,\n",
    "    epochs=50,\n",
    "    batch_size=128,\n",
    "    validation_split=0.1,\n",
    "    callbacks=[\n",
    "        keras.callbacks.EarlyStopping(monitor=\"val_loss\", patience=5, mode=\"min\")\n",
    "    ],\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Let's plot training and validation loss to see how the training went.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "plt.plot(history.history[\"loss\"], label=\"Training Loss\")\n",
    "plt.plot(history.history[\"val_loss\"], label=\"Validation Loss\")\n",
    "plt.legend()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Detecting anomalies\n",
    "\n",
    "We will detect anomalies by determining how well our model can reconstruct\n",
    "the input data.\n",
    "\n",
    "\n",
    "1.   Find MAE loss on training samples.\n",
    "2.   Find max MAE loss value. This is the worst our model has performed trying\n",
    "to reconstruct a sample. We will make this the `threshold` for anomaly\n",
    "detection.\n",
    "3.   If the reconstruction loss for a sample is greater than this `threshold`\n",
    "value then we can infer that the model is seeing a pattern that it isn't\n",
    "familiar with. We will label this sample as an `anomaly`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "# Get train MAE loss.\n",
    "x_train_pred = model.predict(x_train)\n",
    "train_mae_loss = np.mean(np.abs(x_train_pred - x_train), axis=1)\n",
    "\n",
    "plt.hist(train_mae_loss, bins=50)\n",
    "plt.xlabel(\"Train MAE loss\")\n",
    "plt.ylabel(\"No of samples\")\n",
    "plt.show()\n",
    "\n",
    "# Get reconstruction loss threshold.\n",
    "threshold = np.max(train_mae_loss)\n",
    "print(\"Reconstruction error threshold: \", threshold)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "### Compare recontruction\n",
    "\n",
    "Just for fun, let's see how our model has recontructed the first sample.\n",
    "This is the 288 timesteps from day 1 of our training dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "# Checking how the first sequence is learnt\n",
    "plt.plot(x_train[0])\n",
    "plt.show()\n",
    "plt.plot(x_train_pred[0])\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "### Prepare test data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "\n",
    "def normalize_test(values, mean, std):\n",
    "    values -= mean\n",
    "    values /= std\n",
    "    return values\n",
    "\n",
    "\n",
    "test_value = get_value_from_df(df_daily_jumpsup)\n",
    "test_value = normalize_test(test_value, training_mean, training_std)\n",
    "plt.plot(test_value.tolist())\n",
    "plt.show()\n",
    "\n",
    "# Create sequences from test values.\n",
    "x_test = create_sequences(test_value)\n",
    "print(\"Test input shape: \", x_test.shape)\n",
    "\n",
    "# Get test MAE loss.\n",
    "x_test_pred = model.predict(x_test)\n",
    "test_mae_loss = np.mean(np.abs(x_test_pred - x_test), axis=1)\n",
    "test_mae_loss = test_mae_loss.reshape((-1))\n",
    "\n",
    "plt.hist(test_mae_loss, bins=50)\n",
    "plt.xlabel(\"test MAE loss\")\n",
    "plt.ylabel(\"No of samples\")\n",
    "plt.show()\n",
    "\n",
    "# Detect all the samples which are anomalies.\n",
    "anomalies = (test_mae_loss > threshold).tolist()\n",
    "print(\"Number of anomaly samples: \", np.sum(anomalies))\n",
    "print(\"Indices of anomaly samples: \", np.where(anomalies))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Plot anomalies\n",
    "\n",
    "We now know the samples of the data which are anomalies. With this, we will\n",
    "find the corresponding `timestamps` from the original test data. We will be\n",
    "using the following method to do that:\n",
    "\n",
    "Let's say time_steps = 3 and we have 10 training values. Our `x_train` will\n",
    "look like this:\n",
    "\n",
    "- 0, 1, 2\n",
    "- 1, 2, 3\n",
    "- 2, 3, 4\n",
    "- 3, 4, 5\n",
    "- 4, 5, 6\n",
    "- 5, 6, 7\n",
    "- 6, 7, 8\n",
    "- 7, 8, 9\n",
    "\n",
    "All except the initial and the final time_steps-1 data values, will appear in\n",
    "`time_steps` number of samples. So, if we know that the samples\n",
    "[(3, 4, 5), (4, 5, 6), (5, 6, 7)] are anomalies, we can say that the data point\n",
    "5 is an anomaly.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "# data i is an anomaly if samples [(i - timesteps + 1) to (i)] are anomalies\n",
    "anomalous_data_indices = []\n",
    "for data_idx in range(TIME_STEPS - 1, len(test_value) - TIME_STEPS + 1):\n",
    "    time_series = range(data_idx - TIME_STEPS + 1, data_idx)\n",
    "    if all([anomalies[j] for j in time_series]):\n",
    "        anomalous_data_indices.append(data_idx)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Let's overlay the anomalies on the original test data plot.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "df_subset = df_daily_jumpsup.iloc[anomalous_data_indices, :]\n",
    "plt.subplots_adjust(bottom=0.2)\n",
    "plt.xticks(rotation=25)\n",
    "ax = plt.gca()\n",
    "xfmt = md.DateFormatter(\"%Y-%m-%d %H:%M:%S\")\n",
    "ax.xaxis.set_major_formatter(xfmt)\n",
    "\n",
    "dates = df_daily_jumpsup[\"timestamp\"].to_list()\n",
    "dates = [datetime.strptime(x, \"%Y-%m-%d %H:%M:%S\") for x in dates]\n",
    "values = df_daily_jumpsup[\"value\"].to_list()\n",
    "plt.plot(dates, values, label=\"test data\")\n",
    "\n",
    "dates = df_subset[\"timestamp\"].to_list()\n",
    "dates = [datetime.strptime(x, \"%Y-%m-%d %H:%M:%S\") for x in dates]\n",
    "values = df_subset[\"value\"].to_list()\n",
    "plt.plot(dates, values, label=\"anomalies\", color=\"r\")\n",
    "\n",
    "plt.legend()\n",
    "plt.show()\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "timeseries_anomaly_detection",
   "private_outputs": false,
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}